Persistent Rnns: Stashing Weights On-chip
نویسندگان
چکیده
This paper introduces a framework for mapping Recurrent Neural Network (RNN) architectures efficiently onto parallel processors such as GPUs. Key to our approach is the use of persistent computational kernels that exploit the processor’s memory hierarchy to reuse network weights over multiple timesteps. Using our framework, we show how it is possible to achieve substantially higher computational throughput at lower mini-batch sizes than direct implementations of RNNs based on matrix multiplications. Our initial implementation achieves 2.8 TFLOP/s at a mini-batch size of 4 on an NVIDIA TitanX GPU, which is about 45% of theoretical peak throughput, and is 30X faster than a standard RNN implementation based on optimized GEMM kernels at this batch size. Reducing the batch size from 64 to 4 per processor provides a 16x reduction in activation memory footprint, enables strong scaling to 16x more GPUs using data-parallelism, and allows us to efficiently explore end-to-end speech recognition models with up to 108 residual RNN layers.
منابع مشابه
Persistent RNNs: Stashing Recurrent Weights On-Chip
This paper introduces a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possible to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU’s inverted memory ...
متن کاملDNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks
Recently, deep learning with convolutional neural networks (CNNs) and recurrent neural networks (RNNs) has become universal in all-around applications. CNNs are used to support vision recognition and processing, and RNNs are able to recognize time varying entities and to support generative models. Also, combining both CNNs and RNNs can recognize time varying visual entities, such as action and ...
متن کاملSparse Persistent Rnns: Squeezing Large Recurrent Networks On- Chip
Recurrent Neural Networks (RNNs) are powerful tools for solving sequence-based problems, but their efficacy and execution time are dependent on the size of the network. Following recent work in simplifying these networks with model pruning and a novel mapping of work onto GPUs, we design an efficient implementation for sparse RNNs. We investigate several optimizations and tradeoffs: Lamport tim...
متن کاملBlock-Sparse Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are used in state-of-the-art models in domains such as speech recognition, machine translation, and language modelling. Sparsity is a technique to reduce compute and memory requirements of deep learning models. Sparse RNNs are easier to deploy on devices and high-end server processors. Even though sparse operations need less compute and memory relative to their ...
متن کاملFast Weight Long Short-term Memory
Associative memory using fast weights is a short-term memory mechanism that substantially improves the memory capacity and time scale of recurrent neural networks (RNNs). As recent studies introduced fast weights only to regular RNNs, it is unknown whether fast weight memory is beneficial to gated RNNs. In this work, we report a significant synergy between long short-term memory (LSTM) networks...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016